Thematic Segmentation of Texts: Two Methods for Two Kinds of Text
نویسندگان
چکیده
To segment texts in thematic units, we present here how a basic principle relying on word distribution can be applied on different kind of texts. We start from an existing method well adapted for scientific texts, and we propose its adaptation to other kinds of texts by using semantic links between words. These relations are found in a lexical network, automatically built from a large corpus. We will compare their results and give criteria to choose the more suitable method according to text characteristics.
منابع مشابه
A Topic Segmentation of Texts based on Semantic Domains
1 LIMSI-CNRS. BP 133, 91403 Orsay Cedex, France. email: [ferret,grau]@limsi.fr Abstract. Thematic analysis is essential for many Natural Language Processing (NLP) applications, such as text summarization or information extraction. It is a two-dimensional process that has both to delimit the thematic segments of a text and to identify the topic of each of them. The system we present possesses th...
متن کاملA Thematic Segmentation Procedure for Extracting Semantic Domains from Texts
Thematic analysis is essential for a lot of Natural Language Processing (NLP) applications, such as text summarization or information extraction. It is a two-dimensional process which has both to identify the thematic segments of a text and to recognize the semantic domain concerned by each of them. This second task requires having a representation of these domains. Such representations are bui...
متن کاملSegGen: A Genetic Algorithm for Linear Text Segmentation
This paper describes SegGen, a new algorithm for linear text segmentation on general corpuses. It aims to segment texts into thematic homogeneous parts. Several existing methods have been used for this purpose, based on a sequential creation of boundaries. Here, we propose to consider boundaries simultaneously thanks to a genetic algorithm. SegGen uses two criteria: maximization of the internal...
متن کاملThe Compilation of Urbanism Texts by Using the Iranian's-Valuable Texts (With Emphasis on the Islamic Ethics)
It is clear that each community should be have the specific urbanism science. Science localization is an obvious matter. This matter has motivated Iranian researchers, in urbanism field, to naturalize urbanism science having been imported to Iran. One method for producing or indigenizing urbanism texts in Iran, especially in recent years, is Utilization of Iranian-valuable texts. There are high...
متن کاملUnsupervised Learning with Term Clustering for Thematic Segmentation of Texts
In this paper we introduce a machine learning approach for automatic text segmentation. Our text segmenter clusters text-segments containing similar concepts. It first discovers the different concepts present in a text, each concept being defined as a set of representative terms. After that the text is partitioned into coherent paragraphs using a clustering technique based on the Classification...
متن کامل